Who am I to tell you things?

Bioinformatics, statistics, computational biology

Things bioinformaticians care about

  • The biological question
  • Statistics
  • Experimental design
  • Reproducibility
  • File formats

Statistics

Statistics matters

What is a p-value?

\(H_0\): The null hypothesis, no effect

\(H_1\): The alternative hypothesis, there is an effect

We run a test, we get a p-value. What is it?

  • Probability that \(H_0\) is true, given the data
  • Probability that \(H_1\) is wrong, given the data
  • Probability that the data is random
  • Probability of observing the data, given \(H_0\) is true

Our intuition is bayesian, not frequentist

Frequentist Statistics Bayesian Statistics
1. Probability is defined as the long-run frequency of events 1. Probability represents a degree of belief or certainty about an event
2. Parameters (like the “true value”) are fixed but unknown quantities. 2. Parameters are treated as random variables with their own probability distributions.
3. Asking about the probability of a hypothesis does not make sense 3. Asking about the probability of a hypothesis is the main goal

Going beyond the p-value

  • Confidence intervals
  • Effect sizes
  • Power analysis

Why is that important?

P-values are the language of science, whether we like them (we don’t) or not.

  • Use effect sizes always
  • Never rely on p-values alone

Tip

You have to understand p-values and their limits to talk to other scientists!

Experimental design

How many samples are sufficient?

  • Depends on the question
  • Depends on the technology
  • Depends on the variability

How many samples are sufficient?

Say, we want to compare two groups with a standard \(t-test\), nothing fancy. Our ability to detect the differences (the statistical power) depends on the sample size and the effect size1.

How many samples are sufficient?

The \(y\) axis on this plot shows how the power of the test – meaning how often, assuming that the groups really differ by \(d\) on average, you will be able to detect the difference using a t-test.

How many samples are sufficient?

What about the following setup:

  • We have 2 strains (WT and KO)
  • We have treatment + control
  • We want to know whether the treatment has a different effect on the KO strain than on the WT strain

This is a 2x2 design, and we need to consider the interaction term.

How many samples are sufficient?

How many samples are sufficient?

That is not even the worse thing.

Simple calculations show that assuming

  • your power is 80% (really great!)
  • \(p-value\) cutoff is \(0.05\)
  • 90% of the \(H_0\) are true (i.e., 10% of the time the differences are real)

then 36% of your “significant” results are false positives1!

(Plus, you failed to detect 20% of the real differences)

The bottom line

Talk to your bioinformatician early!

Reproducibility

Tale of two papers

Tale of two papers

Tale of two papers

Tale of two papers

Tale of two papers

Lessons learned

  • A lot depends on how you analyze your data
  • This in turn depends on the questions you ask
  • The average “Methods” section is not sufficient for reproducible science!

Reproducible workflows with Rmarkdown

flowchart LR
    A(Program + Text) -->|knitr| B(Text with\nanalysis results)
    B --> C[LaTeX]
    C --> CC[PDF]
    B --> D[Word]
    B --> E[HTML]
    B --> F[Presentation]
    B --> G[Book]

This can be Rmarkdown, Quarto, Jupyter… the goal is that your code and your text are in one place, and the results of your calculations are entered automatically into the text.

Reproducible workflows with Rmarkdown

In systems such ar R markdown, you can put directly your analysis results in your text. For example, when I write that the \(p\)-value is equal to 0.05, I am writing this:

In systems such ar R markdown, you can put directly your
analysis results in your text. For example, when I write that the
$p$-value is equal to `​r p`, I am writing this:

The \(p\)-value above is not entered manually (as 0.05), but is the result of a statistical computation. If the data changes, if your analysis changes, the \(p\)-value above will automatically change as well.

File formats and data management

How we work

flowchart LR
    A(Excel) --> B(Data import)
    AA(CSV, TSV) --> B(Data import)
    AAA(fastq, ...) --> B(Data import)
    B --> C[Data\ncleanup]
    C --> D[Long term storage]
    C --> E[Analysis]
    E --> D
    E --> F(Figures)
    E --> G(Manuscript\nfragments)
    E --> H(Tables\nExcel files)
    F --> I[You]
    G --> I
    H --> I
    I --> E

In the diagram above, two things take usually the most hands-on time:

  • Data cleanup
  • Fine-tuning the analysis results

Identifiers

Excel and gene names

  • Excel converts some words to dates automatically
  • Gene names like MARCH1 are converted to dates
  • In most cases1, you can’t switch off this behavior

Excel and gene names

How (not to) work with Excel

Three reasons why you should follow these rules:

  1. Fewer chances of errors
  2. You bioinformaticians will love you
  3. The analysis will be done much faster

How (not to) work with Excel

Avoid manually change Excel files

  • Manual changes cannot be tracked automatically
  • You have to record every change you make
  • Otherwise, this is not reproducible science!

How (not to) work with Excel

Never use formatting for data

Never encode information as formatting, always use explicit columns

Color / font size / font style cannot be read automatically

How (not to) work with Excel

Don’t combine values and comments

Make a separate column for comments

Otherwise the values might be lost1

How (not to) work with Excel

Don’t put meta-information into column names

Make a separate excel sheet for column meta information

How (not to) work with Excel

(for your reference)

  • Avoid manually changing Excel files
  • Never use formatting for data
  • Don’t combine values and comments
  • Don’t put meta-information into column names
  • One sheet = one table
  • Header = one line
  • Do not use merged cells
  • Use consistent file names
  • Avoid spaces in file and column names (use underscores)

Some more tips and summaries

Things we don’t like

  • Cleaning up data
  • Data dredging
  • P-hacking
  • Post-hoc hypotheses
  • Excel
  • Manual changes like changing fonts in figures
  • Non-reproducible science

Things we love

  • Clear questions
  • A priori hypotheses
  • Challenging statistics
  • Creating new tools
  • R and Rmarkdown, or
  • Python and Jupyter
  • Reproducible workflows
  • Well organized data

Things that you should probably learn

  • Learn how to code (preferably R or Python)
  • Learn reproducible workflows with Rmarkdown or Jupyter

Thank you

You can find this presentation along its source code at https://github.com/bihealth/howtotalk